LR Schedules
Techniques used to update learning rate for a better learning.
Learning Rate Warmup and Decay
Practice of gradually increasing learning rate to its peak value. This allows for optimization algorithms to adapt and initialize. This theory (that Adam-like optimizations working better with warmup) was validated by RAdam (2019) paper.
TODO: Loss Catapult theory
Inverse Square Root
is decayed proportional to . Starts at maximum LR, halves by 4th step, so the decay is aggressive at start.
Linear Warmup + Cosine Decay
increases linearly then follows cosine curve to decay near zero. Strategy that's used in GPT-3, PaLM and Llama 2 (2018-2022).
+ Effective when total number of steps are known.
Warmup-Stable-Decay (WSD)
increases linearly, stay at peak for a long period and decays in a short period. Decay can be linear or square rooted.
TODO: Schedule-Free Optimizers